Online clustering and citation analysis using Streemer

نویسنده

  • Vasileios Kandylas
چکیده

Clustering algorithms can be viewed as following an algorithmic or a probabilistic approach. Algorithmic methods such as k-means or streaming clustering are fast and simple but tend to be ad hoc and hence hard to customize to particular problems, whereas the probabilistic methods are more flexible, but slower. In this work we propose online algorithms which combine the advantages of the two classes of approaches giving fast, scalable clustering, while allowing more flexible models of the data, such as foreground clusters interspersed within a diffuse background. These clusters are shown to be useful in modeling scientific citations. We start the thesis by giving a non-probabilistic, few-pass algorithm, called Streemer. Streemer uses thresholds on similarities between points to find a large number of clusters on the first pass over the data. It then merges them to find larger and more cohesive clusters. In a final pass it assigns points to the clusters or to a diffuse background. Streemer avoids the standard k-means assumptions that clusters are of similar sizes. We also discuss the nature of the objective function that Streemer optimizes through its several steps and heuristics. At a cursory glance, Streemer appears to be an ad hoc algorithm, but in a subsequent chapter we develop a principled algorithm that emulates Streemer's steps and we make the connection between Streemer and online Dirichlet Process Mixture Models. We use Streemer to cluster documents based on the documents they cite and find "knowledge communities" of authors that build on each other's work. The evolution over time of these clusters gives us insight into their growth or shrinkage. We also build predictive models with features based on the citation structure, the vocabulary of the papers, and the affiliations and prestige of the authors and use these models to study the drivers of community growth and the predictors of how widely a paper will be cited. The analysis shows that scientific knowledge communities tend to grow more rapidly if their publications build on diverse information and use narrow vocabulary and that papers that lie on the periphery of a community have the highest impact, while those not in any community have the lowest impact. We also present a probabilistic mixture model with a Dirichlet Process prior and Gaussian component distributions. This model allows for variable cluster numbers and sizes. We show how to use this model for clustering in an online fashion and also propose a two-pass algorithm, where the first pass clusters points in many clusters and the second pass clusters the output of the first pass. With the exception of foreground/ background clustering, the model with the two-pass algorithm corresponds closely to Streemer. Finally, we present an EM-based clustering method that can simultaneously cluster two or more variables using one or more tables of co-occurrence data. One application of this multi-way clustering algorithm is for constructing or augmenting ontologies. We test our algorithm by simultaneously clustering verbs and nouns using both verb-noun and noun-noun co-occurrence pairs. This strategy provides greater coverage of words than using either set of pairs alone, since not all words appear in both datasets. We demonstrate it on data extracted from Medline and evaluate the results using MeSH and Wordnet. Degree Type Dissertation This dissertation is available at ScholarlyCommons: http://repository.upenn.edu/edissertations/59 Degree Name Doctor of Philosophy (PhD) Graduate Group Computer and Information Science First Advisor Lyle H. Ungar

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The online attention to certain nuclear medicine topics: An altmetrics study vs. a citation analysis

Introduction: Traditional citation analysis has been greatly criticized because the process of citation accumulation requires considerable time after publication. So, the term “altmetrics” was proposed in 2010 to measure the scientific and social impact of a paper.We performed a search for certain nuclear medicine topics using the altmetrics approach to report the correlation b...

متن کامل

ترسیم چشم‌انداز پژوهش در علم‌سنجی و حوزه‌های سنجشی وابسته

Among the prevalent topics in Knowledge and Information Science, the scientometrics studies are of special interest. Applying co-citation analysis, this study investigated the landscape of research in scientometrics and related metric areas and revealed the fundamental themes of it. The initial data of this study (including scientometrics-related documents) have been extracted from the Web of ...

متن کامل

Three-Tier Clustering: An Online Citation Clustering System

In this paper, we present an online citation entry clustering system based on three-tier clustering. The objective is to further process search results returned by bibliography databases and present to the user with more accurate results. By our approach, a user first issues an author name query and it is passed to a data source chosen by the user. We then exploit the unique usage of each citat...

متن کامل

BotOnus: an online unsupervised method for Botnet detection

Botnets are recognized as one of the most dangerous threats to the Internet infrastructure. They are used for malicious activities such as launching distributed denial of service attacks, sending spam, and leaking personal information. Existing botnet detection methods produce a number of good ideas, but they are far from complete yet, since most of them cannot detect botnets in an early stage ...

متن کامل

Supporting online material for Mapping change in large networks

Here we lay out the details of how we generate significance clusters and alluvial diagrams for mapping change in networks. Because this method assesses how much confidence we should have in the clustering of a network, we can detect, highlight, and simplify the significant structural changes that occur over time or between states in large networks, for example, citation networks, traffic networ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009